We will work with air pollution data from the U.S. Environmental Protection Agency (EPA). The EPA has a national monitoring network of air pollution sites that The primary question you will answer is whether daily concentrations of PM2.5 (particulate matter air pollution with aerodynamic diameter less than 2.5 \(\mu\)m) have decreased in California over the last 20 years (from 2002 to 2022).
Steps
Given the formulated question from the assignment description, you will now conduct EDA Checklist items 2-4. First, download 2002 and 2022 data for all sites in California from the EPA Air Quality Data website. Read in the data using data.table(). For each of the two datasets, check the dimensions, headers, footers, variable names and variable types. Check for any data issues, particularly in the key variable we are analyzing. Make sure you write up a summary of all of your findings.
Site Name DAILY_OBS_COUNT PERCENT_COMPLETE AQS_PARAMETER_CODE
Length:15976 Min. :1 Min. :100 Min. :88101
Class :character 1st Qu.:1 1st Qu.:100 1st Qu.:88101
Mode :character Median :1 Median :100 Median :88101
Mean :1 Mean :100 Mean :88215
3rd Qu.:1 3rd Qu.:100 3rd Qu.:88502
Max. :1 Max. :100 Max. :88502
AQS_PARAMETER_DESC CBSA_CODE
Length:15976 Min. :12540
Class :character 1st Qu.:23420
Mode :character Median :40140
Mean :33270
3rd Qu.:41740
Max. :49700
NA's :929
summary(twotwo[,8:13])
Site Name DAILY_OBS_COUNT PERCENT_COMPLETE AQS_PARAMETER_CODE
Length:57775 Min. :1 Min. :100 Min. :88101
Class :character 1st Qu.:1 1st Qu.:100 1st Qu.:88101
Mode :character Median :1 Median :100 Median :88101
Mean :1 Mean :100 Mean :88196
3rd Qu.:1 3rd Qu.:100 3rd Qu.:88101
Max. :1 Max. :100 Max. :88502
AQS_PARAMETER_DESC CBSA_CODE
Length:57775 Min. :12540
Class :character 1st Qu.:31080
Mode :character Median :40140
Mean :35447
3rd Qu.:41860
Max. :49700
NA's :4761
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 7.00 12.00 16.12 20.50 104.30
summary(twotwo$`Daily Mean PM2.5 Concentration`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.200 4.200 7.000 8.574 10.900 302.500
twotwo <- twotwo[twotwo$`Daily Mean PM2.5 Concentration`>=0,]mean(is.na(twotwo$CBSA_CODE))
[1] 0.08237052
mean(is.na(Otwo$CBSA_CODE))
[1] 0.05814972
Both the data sets have 20 columns which means they should have the same variables across both years. However, the 2022 data set has a lot more rows than the 2002 data set, meaning there are a lot more days of data recorded and/or more sites added in this years collection of data. The sumary of the two sets showed that the min and max were both at extremes with the 2022 data showing a negative value (-2.2) for the minimum while the maximum was extremely high (302.5). 2002 showed a somewhat similar trend with a minimum of 0 and a maximum of 104.3. Upon further research, it showed that negative pm2.5 concentrations are invalid but it is possible to have those high values so I made the decision to eliminate the less than 0 values and keep the higher ones. Some of the CBSA code values were missing but I decided to keep them since they were not relevant to the study we are trying to be done for this assignment and all the other relevant values were recorded.
Combine the two years of data into one data frame. Use the Date variable to create a new column for year, which will serve as an identifier. Change the names of the key variables so that they are easier to refer to in your code.
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:data.table':
between, first, last
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Create a basic map in leaflet() that shows the locations of the sites (make sure to use different colors for each year). Summarize the spatial distribution of the monitoring sites.
There seems to not be as much blue due to the discrepancy in amount of cases in 2002 vs. 2022 but from what I can see, the blues (2002) are more along the coast of California and the Eastern border of California for some reason. Whereas the reds (2022) are a lot more spread out and dominating the state in this leaflet map, showing a big increase in sites in the 20 years passed since the 2002 data set.
Check for any missing or implausible values of PM2.5 in the combined dataset. Explore the proportions of each and provide a summary of any temporal patterns you see in these observations.
summary(merged$PM2.5)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 4.60 7.70 10.24 12.50 302.50
hist(merged$PM2.5)
There were no missing values of pm2.5 however there were a couple of outliers in the extreme high side since the merged data set had a 1st quartile value of 4.6, median of 7.7, and a 3rd quartile value of 12.5 but the maximum is 302.5, suggesting a majority of the data is in the lower number values. To explore this pattern, I made a histogram and a heavy amount of the data (over 80%) lies in the below 25 pm2.5 region with some going from 25-60, and then the outliers barely showing up on the histogram since there are about 15-20 of them in the “hundreds” region compared to the dataset consisting of 73,533 values.
Explore the main question of interest at three different spatial levels (whether daily concentrations of PM2.5 (particulate matter air pollution with aerodynamic diameter less than 2.5 \(\mu\)m) have decreased in California over the last 20 years (from 2002 to 2022).
Create exploratory plots (e.g. boxplots, histograms, line plots) and summary statistics that best suit each level of data. Be sure to write up explanations of what you observe in these data.
state
county
site in Los Angeles
library(ggplot2)pm25county <- merged %>%filter(COUNTY =='Calaveras'& (Year ==2002| Year ==2022))ggplot(data = pm25county, aes(x = PM2.5, fill =as.factor(Year))) +geom_histogram(binwidth =2, position ="dodge") +labs(x ="PM2.5", y ="Frequency", fill ="Year",title =paste("PM2.5 rates in Calaveras for 2002 vs 2022")) +theme_minimal()
pm25site <- merged %>%filter(SiteName =='Los Angeles-North Main Street'& (Year ==2002| Year ==2022))ggplot(data = pm25site, aes(x = Year, y = PM2.5, group =1)) +geom_line() +labs(x ="Year", y ="PM2.5",title =paste("PM2.5 rates in Los Angeles-N. Main St. for 2002 and 2022")) +theme_minimal()
pm25state <- merged %>%filter(Year %in%c(2002, 2022))ggplot(data = pm25state, aes(x =as.factor(Year), y = PM2.5)) +geom_boxplot(fill ="lightblue", color ="red") +labs(x ="Year", y ="PM2.5",title ="Box Plot of PM2.5 levels in California for 2002 and 2022") +scale_y_continuous(breaks =seq(0, max(merged$PM2.5), by =50)) +theme_minimal()
In Calaveras County, the PM2.5 rates have relatively decreased since the higher values near 30 and 40 pm2.5 are from 2002 whereas the highest 2022 shows is about 27 pm2.5. There is a much higher volume of data for 2022 but a majority of the data is below 12 pm2.5, which is considered healthy by most studies.
In the LA N. Main St. Site, the PM2.5 rates have relatively decreased a lot since the higher values are over 60 and from 2002 whereas the highest 2022 shows is about 38 pm2.5. There is a much higher volume of data for 2022, as stated before.
In the state of California, the PM2.5 rates have increased, in terms of maximum values, since the higher values are higher than 150 pm2.5 and go all the way to over 300 pm2.5 in 2022. However, if you look at the interquartile range depicted in the just the box (excluding the extreme values), the Q1, Q3, and median values are all lower in 2022 than they were in 2002. So, in terms of overall general data, the PM2.5 rates in California in 2022 have decreased since 2002.